Describe the work you have done this week and summarize your learning.
date()
## [1] "Sun Nov 27 23:27:42 2022"
Here we go again…
Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).
#install.packages(c("tidyverse","GGally","readr", "MASS", "corrplot"))
Load the needed R packages before getting to work:
#load required packages
library(tidyverse)
library(GGally)
library(dplyr)
library(ggplot2)
library(MASS)
library(corrplot)
# load the 'Boston' data
data("Boston")
# explore the dataset
str(Boston);dim(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506 14
The ‘Boston’ dataset from the MASS library is a data frame of Housing Values in Suburbs of Boston. It has 506 observations (rows) and 14 variables (columns).
The variables are:
The R documentation of the dataset can be found here.
# summary of the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# data from wide to long
long_Boston <- pivot_longer(Boston, cols=everything(), names_to = 'variable', values_to = 'value')
# Histograms of all variables in use
p1 <- ggplot(data=long_Boston)
p1 + geom_histogram(mapping= aes(x=value)) +
facet_wrap(~variable, scales="free")
# Summary plot matrix with correlation and density plots
p2 <- ggpairs(Boston, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p2
All variables of the dataset ‘Boston’ are numeric. Most of the variables are skewed, so we’ll use Spearman’s correlation for correlation assessment later. There are multiple variables with possible outliers. Correlation data is difficult to see from this plot with this many variables and shows Pearson’s correaltions by default, so we’ll draw another one using the cor() (from the ‘corrplot’ library) focusing on the correlations between variables.
# calculate the correlation matrix
cor_matrix <- cor(Boston, method='spearman')
# print the correlation matrix
cor_matrix %>% round(1)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## crim 1.0 -0.6 0.7 0.0 0.8 -0.3 0.7 -0.7 0.7 0.7 0.5 -0.4 0.6
## zn -0.6 1.0 -0.6 0.0 -0.6 0.4 -0.5 0.6 -0.3 -0.4 -0.4 0.2 -0.5
## indus 0.7 -0.6 1.0 0.1 0.8 -0.4 0.7 -0.8 0.5 0.7 0.4 -0.3 0.6
## chas 0.0 0.0 0.1 1.0 0.1 0.1 0.1 -0.1 0.0 0.0 -0.1 0.0 -0.1
## nox 0.8 -0.6 0.8 0.1 1.0 -0.3 0.8 -0.9 0.6 0.6 0.4 -0.3 0.6
## rm -0.3 0.4 -0.4 0.1 -0.3 1.0 -0.3 0.3 -0.1 -0.3 -0.3 0.1 -0.6
## age 0.7 -0.5 0.7 0.1 0.8 -0.3 1.0 -0.8 0.4 0.5 0.4 -0.2 0.7
## dis -0.7 0.6 -0.8 -0.1 -0.9 0.3 -0.8 1.0 -0.5 -0.6 -0.3 0.2 -0.6
## rad 0.7 -0.3 0.5 0.0 0.6 -0.1 0.4 -0.5 1.0 0.7 0.3 -0.3 0.4
## tax 0.7 -0.4 0.7 0.0 0.6 -0.3 0.5 -0.6 0.7 1.0 0.5 -0.3 0.5
## ptratio 0.5 -0.4 0.4 -0.1 0.4 -0.3 0.4 -0.3 0.3 0.5 1.0 -0.1 0.5
## black -0.4 0.2 -0.3 0.0 -0.3 0.1 -0.2 0.2 -0.3 -0.3 -0.1 1.0 -0.2
## lstat 0.6 -0.5 0.6 -0.1 0.6 -0.6 0.7 -0.6 0.4 0.5 0.5 -0.2 1.0
## medv -0.6 0.4 -0.6 0.1 -0.6 0.6 -0.5 0.4 -0.3 -0.6 -0.6 0.2 -0.9
## medv
## crim -0.6
## zn 0.4
## indus -0.6
## chas 0.1
## nox -0.6
## rm 0.6
## age -0.5
## dis 0.4
## rad -0.3
## tax -0.6
## ptratio -0.6
## black 0.2
## lstat -0.9
## medv 1.0
# visualize the correlation matrix
# cl.pos= position of the color legend, e.g. cl.pos = 'b'
# tl.pos= position of the text labels
# tl.cex = size of text labels
# type = type of plot, 'full', 'upper', 'lower'
# method = the visualization of the correlation coefficients, 'circle', 'square', 'pie', 'number', etc.
# Basic one-style correlation plot: corrplot(cor_matrix, method="ellipse", tl.pos ="d", tl.cex=0.6)
# Cool mixed correlation plot:
corrplot.mixed(cor_matrix, lower = 'number', upper = 'ellipse',tl.pos ="lt", tl.col='black', tl.cex=0.6, number.cex = 0.5)
The darker the color (see color legend, either blue or red) and the more straight-line like ellipse, the larger the correlation coefficient between variables. The exact coefficient can be seen on the lower part of the plot. The slope and the color (blue vs. red) of the ellipses depict the direction (positive vs negative) of the correlation. Unfortunately, with cor() we don’t have the p-values for the correlations.
There seems to be some strong negative (= more of A correlates to more of B) correlations between:
There are some positive (= more of A correlates to less of B) correlations too:
In summary, some examples of possible correlations:
We’ll now standardize the data by scaling it. Boston data has only numerical variables, so we’ll be able to standardize them all in one go with scale().
Quoted from our course material In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation.
\[scaled(x) = \frac{x - mean(x)}{ sd(x)}\]
# center and standardize variables with scale()
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled <- as.data.frame(boston_scaled)
After scaling, the means of all variables is standardized to zero (0). Scaling the data with different original scales makes the comparison and analysis of the variables easier and possible. The scale() function transforms the data into a matrix and array so for further analysis we changed it back to a data frame.
Let’s replace the continuous crime rate variable with a categorical variable of low, medium low, medium high and high crime rate (scaled).
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
#check new dataset
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
We’ll now divide the dataset to training (80%) and testing (20%) datasets for further model fitting and testing.
# number of rows in the scaled Boston dataset
n <- nrow(boston_scaled)
n
## [1] 506
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set (random 80% of the scaled Boston data)
train <- boston_scaled[ind,]
# create test set (random 20% of the scaled Boston data)
test <- boston_scaled[-ind,]
We’ll fit the linear discriminant analysis (LDA), a classification (and dimension reduction) method, on the training set.
# linear discriminant analysis on the training set
# target variable = crime
# all other variables (.) are the predictor variables
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2301980 0.2475248 0.2722772 0.2500000
##
## Group means:
## zn indus chas nox rm age
## low 0.93830934 -0.9067066 -0.18766025 -0.8911641 0.47088160 -0.8728044
## med_low -0.07947862 -0.3135085 0.04263895 -0.6039650 -0.14271554 -0.4456984
## med_high -0.38433506 0.2017579 0.22875642 0.4312968 0.09461014 0.4185091
## high -0.48724019 1.0171306 -0.11640431 1.0386336 -0.37957200 0.7945154
## dis rad tax ptratio black lstat
## low 0.8942333 -0.6904323 -0.7134165 -0.48556996 0.37627045 -0.7587121
## med_low 0.4265047 -0.5500476 -0.4585171 -0.07276562 0.30981226 -0.1502386
## med_high -0.3847551 -0.3940649 -0.2802775 -0.28725768 0.08109453 0.0325175
## high -0.8477995 1.6379981 1.5139626 0.78062517 -0.80297395 0.8703820
## medv
## low 0.52523430
## med_low -0.01106939
## med_high 0.15923125
## high -0.70288315
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0721346701 0.65045000 -1.08342110
## indus 0.0007975755 -0.27999191 0.23351026
## chas -0.0800368252 -0.11333339 0.20966697
## nox 0.3796180585 -0.82836201 -1.08803628
## rm -0.1017762621 -0.04623747 -0.17310567
## age 0.2810826210 -0.31443619 -0.26475188
## dis -0.0687862542 -0.25017385 0.27597201
## rad 3.0578083430 0.73554890 -0.19118712
## tax -0.0850179074 0.27819728 0.64503886
## ptratio 0.1321286554 -0.08314339 -0.29786151
## black -0.1351019887 0.02526919 0.09348762
## lstat 0.1775301512 -0.15329041 0.35536849
## medv 0.1680846002 -0.41081521 -0.20097225
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9397 0.0447 0.0156
There are three discriminant functions (LD1, LD2, LD3). Let’s biplot the results of these functions:
# the function for lda biplot arrows (variables)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results, all dimensions
plot(lda.fit, dimen = 3, col = classes, pch = classes)
# plot LD1 vs. LD2
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
LD1 and LD2 are the strongest functions to separate our observations on
the target variable (as we can also see from the proportion of
trace on the summary of the LDA model output and the graphical
overview of all the dimensions LD1 through LD3).
The graphical overview of or LDA model shows that LD1 and LD2 clearly separate the group of observations: high per capita crime rate (from our target variable, crime) is separated from the other, lower quartiles. Crime rate seems to be higher when index of accessibility to radial highways is higher. Other variables don’t draw as much attention.
To predict crime rate categories with our LDA model we have to save the true crime categories from the test set, and then remove them from the data to then create a variable with the predicted crime rate classes.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable (true/correct crime rate classes) from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results %>% add sums
tbres1 <- table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
tbres1
## predicted
## correct low med_low med_high high Sum
## low 18 14 2 0 34
## med_low 2 14 10 0 26
## med_high 0 3 13 0 16
## high 0 0 0 26 26
## Sum 20 31 25 26 102
#number of rows in test set
nrow(test)
## [1] 102
Our model seems to predict crime rates quite nicely:
We want to investigate the the similarity between our observations, so we’ll create clusters of similar datapoints with running a k-means algorithm on the dataset.
We’ll first standardize the data for further analysis.
# Reload the 'Boston' data
data("Boston")
# Recenter and restandardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled <- as.data.frame(boston_scaled)
As instructed for our assignment, we’ll calculate the distances between observations to assess the similarity between our data points. We’ll calculate both euclidean and manhattan distances. It is worth noting that k-means clustering algorithm uses Euclidean distances.
# Calculate distances between the observations.
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
## Run k-means algorithm on the dataset
## k-means clustering
# K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that.
set.seed(123)
km <- kmeans(boston_scaled, centers = 4)
# plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
pairs(boston_scaled[6:10], col = km$cluster)
With kmeans clustering with 4 clusters there seems to be some overlap
between clusters (based on the plots). Let’s investigate the optimal
number of clusters for our model with the total
within sum of squares (WCSS) and how it changes when the
number of clusters changes.
# Investigate what is the optimal number of clusters and run the algorithm again
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line', ylab='Total within sum of squares (WCSS)', xlab='number of clusters')
The optimal number of clusters is when the total WCSS drops radically. Here, the optimal number of clusters for our model seems to be 2. Let’s run the algorithm again with 2 clusters and visualize the results.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# Visualize the clusters
# plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
pairs(boston_scaled[1:5], col = km$cluster)
pairs(boston_scaled[6:10], col = km$cluster)
pairs(boston_scaled[11:14], col = km$cluster)
We determined that grouping our data with k-means clustering is best done with two clustering centroids (-> yields two clusters). However, we do not know what these clusters actually represent.
Form the plots we can see that there are very clear separate clusters at least within rad and tax.
To determine possible reasons for the clustering result we could draw a parallel coordinates plot… See this instruction or this.
# Reload the 'Boston' data
data("Boston")
# Recenter and restandardize variables
boston_scaled_bonus <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled_bonus)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled_bonus)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled_bonus <- as.data.frame(boston_scaled_bonus)
Let’s choose to run the kmeans algorithm with an arbitrary amount of four clusters.
## Run k-means algorithm with 6 cluster centers on the dataset
## k-means clustering
# K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that (used already in a previous chunck of code in this document)
km_b <- kmeans(boston_scaled_bonus, centers = 4)
# plot the scaled Boston dataset with clusters
#pairs(Boston, col = km$cluster)
#pairs(Boston[6:10], col = km$cluster)
#### Perform LDA using the clusters as target classes
# assess the kmeans() result summary
summary(km_b) #it is not a dataframe
## Length Class Mode
## cluster 506 -none- numeric
## centers 56 -none- numeric
## totss 1 -none- numeric
## withinss 4 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 4 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
typeof(km_b) #it is a 'list' object
## [1] "list"
str(km_b)
## List of 9
## $ cluster : Named int [1:506] 1 1 1 2 2 1 1 1 1 1 ...
## ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
## $ centers : num [1:4, 1:14] -0.377 -0.408 0.859 -0.199 -0.333 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "1" "2" "3" "4"
## .. ..$ : chr [1:14] "crim" "zn" "indus" "chas" ...
## $ totss : num 7070
## $ withinss : num [1:4] 1102 623 1380 341
## $ tot.withinss: num 3446
## $ betweenss : num 3624
## $ size : int [1:4] 222 98 152 34
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
# --> km seems to be a 'list' of 'lists', with 506 observations.
# More detailed info:
#?kmeans()
# subset 'cluster' from km (list) and add it as a column to our 'boston_scaled' dataframe
km_b_clust <- km_b$cluster
km_b_clust <- as.data.frame(km_b_clust)
boston_scaled_bonus <- mutate(boston_scaled_bonus, km_b_clust)
# Run linear discriminant analysis with clusters as target classes and all other variables of the data set as pexplanatory variables
lda.fit_bonus <- lda(km_b_clust ~ ., data = boston_scaled_bonus)
# print the lda.fit object
lda.fit_bonus
## Call:
## lda(km_b_clust ~ ., data = boston_scaled_bonus)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.43873518 0.19367589 0.30039526 0.06719368
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3773550 -0.3325348 -0.3302399 -0.2723291 -0.3380896 -0.07661724
## 2 -0.4083186 1.5993013 -1.0868786 -0.2321546 -1.0252997 0.85762197
## 3 0.8588075 -0.4872402 1.1204442 -0.2723291 1.0691487 -0.50270159
## 4 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.27566811
## age dis rad tax ptratio black
## 1 -0.06014166 0.07356985 -0.588184980 -0.6041027 -0.06143853 0.2840492
## 2 -1.22890399 1.28526187 -0.598658222 -0.6419128 -0.74348995 0.3532213
## 3 0.79691805 -0.84588600 1.244794751 1.3179961 0.65687472 -0.6809675
## 4 0.37213224 -0.40333817 0.001081444 -0.0975633 -0.39245849 0.1715427
## lstat medv
## 1 -0.2086903 0.03252933
## 2 -0.9364928 0.91875080
## 3 0.9453522 -0.76810974
## 4 -0.1643525 0.57334091
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim -0.034836228 0.01232052 0.13192586
## zn -0.267361320 0.15788045 1.20991564
## indus 0.020533059 -0.58778276 0.21069583
## chas 5.828653032 0.14949639 0.09257929
## nox 0.026703754 -0.34585368 0.54345497
## rm -0.068843435 0.07156541 0.37000239
## age 0.098415418 -0.04458773 -0.53934235
## dis 0.053100700 0.23565203 0.47758361
## rad 0.064174139 -0.72837621 0.11551231
## tax 0.033261463 -0.60257854 0.70761129
## ptratio -0.007773957 -0.10230899 0.07726420
## black 0.015155025 0.03258028 -0.02842152
## lstat -0.133125916 -0.26354444 0.52415255
## medv -0.146715844 0.14530719 0.54132868
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.8121 0.1472 0.0407
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes / clusters as numeric
classes_b <- as.numeric(boston_scaled_bonus$km_b_clust)
# plot the lda results
plot(lda.fit_bonus, col = classes_b, pch = classes_b)
plot(lda.fit_bonus, dimen = 2, col = classes_b, pch = classes_b)
lda.arrows(lda.fit_bonus, myscale = 1)
plot(lda.fit_bonus, dimen = 2, col = classes_b, pch = classes_b)
lda.arrows(lda.fit_bonus, myscale = 9)
When clustering the Boston data to 4 clusters with the k-means algorithm the most influential linear separators seem to be:
The results are somewhat different from our results with 2-cluster algorithm. When looking at the full matrix of biplots (LD1, LD2, LD3) we could suspect there being either two or five different clusters. From our earlier analyses we do know though that two clusters seem to be the best clustering option.
OBS! The results changed after running the R code again. To globally answer our assignment, we can say that the three most influential variables are the ones with the longest arrows on the LD1 vs LD2 plot. For easier assessment of the arrows, the second LD1 vs LD2 plot has scaled arrows (see myscale argument).
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
# install.packages('plotly')
library('plotly')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
We drew a very cool dynamic 3D plot with the instructions our our course assignment! Let’s customize it:
# add colors
# set the color to be the crime classes of the train set
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes) %>% layout(title = 'Plot A')
To set the colors to be the clusters of the kmeans, we’ll join data with left_join() of dplyr:
# set the color to be the clusters of the kmeans (2 clusters)
# subset 'cluster' from km (list) and add it as a column to our 'boston_scaled' dataframe
km_sb_clust <- km$cluster
km_sb_clust <- as.data.frame(km_sb_clust)
boston_scaled_sb <- mutate(boston_scaled, km_sb_clust)
# merge data
train_sb = train %>% left_join(boston_scaled_sb)
# check data
head(train_sb);dim(train_sb)
## zn indus chas nox rm age
## 1 -0.4872402 -1.2647715 -0.2723291 -0.57556435 2.10693068 0.5231153
## 2 -0.4872402 1.0149946 -0.2723291 1.36613839 0.02329236 0.5373254
## 3 2.9429307 -1.3668070 -0.2723291 -1.46443272 -0.07775840 -1.3171013
## 4 -0.4872402 1.0149946 -0.2723291 0.51178918 -0.25851118 0.5870610
## 5 -0.4872402 -0.1802792 -0.2723291 -0.09229612 -0.23146943 -0.5604099
## 6 0.4131797 -0.8012385 -0.2723291 -0.99842406 -0.45776621 -0.8126404
## dis rad tax ptratio black lstat
## 1 -0.5005640 -0.7521778 -1.2770905 -0.30279450 0.4259382 -0.7132081
## 2 -0.4805707 1.6596029 1.5294129 0.80577843 -0.9251783 0.5008971
## 3 2.5141909 -0.9818712 -0.9922868 -0.11803234 -0.1651137 0.0387809
## 4 -0.8421115 1.6596029 1.5294129 0.80577843 -3.8792328 1.4895456
## 5 -0.5483863 -0.6373311 -0.6184819 -0.02565127 0.4406159 -0.9344638
## 6 1.4340328 -0.6373311 -0.9804200 -0.76469989 0.4259382 0.1115992
## medv crime crim km_sb_clust
## 1 1.87745985 low -0.4124426 2
## 2 -0.82991410 high 0.4811585 1
## 3 -0.26451873 low -0.4178172 2
## 4 -0.99300891 high 0.3995673 1
## 5 -0.04705898 med_high -0.3282101 2
## 6 -0.30801068 med_low -0.4097861 2
## [1] 404 16
# target classes / clusters as numeric
classes_sb <- as.numeric(train_sb$km_sb_clust)
# Draw the 3D plot
model_predictors_sb <- dplyr::select(train_sb, -crime, -crim, -km_sb_clust)
# check the dimensions
dim(model_predictors_sb)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product_sb <- as.matrix(model_predictors_sb) %*% lda.fit$scaling
matrix_product_sb <- as.data.frame(matrix_product_sb)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
# install.packages('plotly')
library('plotly')
plot_ly(x = matrix_product_sb$LD1, y = matrix_product_sb$LD2, z = matrix_product_sb$LD3, type= 'scatter3d', mode='markers', color = classes_sb) %>% layout(title = 'Plot B')
Plot A shows four different colors depicting our four different crime rate quartiles: ‘low = 1’, ‘medium low = 2’, ‘medium high = 3’, ‘high = 4’. Plot B shows coloring by our optimal two clusters. Comparing plots A and B we notice that the dark blue cluster (Plot B) seems to include some observations of ‘medium low’ crime rate classification. And the yellow cluster of plot B seems to include all low, medium low and medium high crime rate classes. Some medium high crime rate classes seem to have been clustered to the dark blue cluster of plot B. In summary, our model seems to cluster separately quite nicely our high crime rate suburbs.